Logo DNP
SaT Innovation & Analytics - Data Challenge by Jairo Ruiz Saenz
Data Challenge by Jairo Ruiz Saenz
  • Home
  • Code Repository

Now that we have a clean dataset let's dive into it and find some useful insights

Exploratory Data Analysis - High School

In [1]:
# Import of libraries used in the script
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import plotly.express as px
import plotly.graph_objects as gp
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans

from utils import build_variables_dict, get_plotly_df, update_fig_layout
In [2]:
# Set display options to ensure all columns and rows are displayed when using functions like df.head()

pd.set_option('display.max_columns', None)  # Set maximum number of columns to display to None (unlimited)
pd.set_option('display.max_rows', None)  # Set maximum number of rows to display to None (unlimited)
pd.set_option('display.precision', 2)  # Set precision for float numbers to 2 decimal places
pd.set_option('display.max_colwidth', None)  # Set maximum column width to None (unlimited)
In [3]:
# Read high school clean dataset
file_path = '../../results/data/high_school_dataset_clean.csv'
df = pd.read_csv(file_path, sep=',')
C:\Users\jairo\AppData\Local\Temp\ipykernel_1492\2381607999.py:3: DtypeWarning: Columns (36,43,44,59,270,298,803,806,809,1015) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv(file_path, sep=',')
In [4]:
clean_data_df = df.copy()
clean_data_df['PERIODO'] = clean_data_df['PERIODO'].apply(lambda x: '{:.0f}'.format(x))
clean_data_df_ITESO = clean_data_df[clean_data_df['CCT_INS_PLA']=='14MMS0519C']

Let's understand ITESO's High School Profile

In [5]:
group_by_dict = { 
    'PERIODO':'year',
    'NOMBRE_INS_PLA':'institution',
    'C_MODALIDAD':'modality', 
    'C_OPCION_EDUCATIVA':'option',
    'C_TURNO':'shift'
}
variables_dict = build_variables_dict(['V692'])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)

title='Number of high school students enrolled at ITESO per year, shift, modality and option'
fig = px.histogram(
    plotly_df, x="year", y="students", 
    title=title,
    color='shift', 
    text_auto='.s',
    facet_col='modality', 
    facet_row='option', 
    barmode='group'
)
fig = update_fig_layout(fig, "high_school", title, height=800)
fig.show()
In [6]:
plotly_df
Out[6]:
year institution modality option shift students sex age type grade
0 2019 ITESO MIXTA MIXTA NOCTURNO 69 all Total total total
1 2020 ITESO MIXTA MIXTA NOCTURNO 0 all Total total total
2 2021 ITESO ESCOLARIZADA PRESENCIAL MATUTINO 192 all Total total total
3 2022 ITESO ESCOLARIZADA PRESENCIAL MATUTINO 394 all Total total total

As we can see from both the table and the graph, there have been some changes in how ITESO operates.

In 2021, it changed:

  • Modality: From a mixed modality to formal education ("Escolarizada")
  • Educational Option: From mixed to formal on-site ("Presencial")
  • Shift: From nighttime to daytime ("Matutino")

Taking out 2020 because of COVID, we can see an increase in the number of enrolled students each year, with a growth of enrolled students of 178% from 2019 to 2021, and a 105% growth the following year

In [7]:
plotly_df_filtered = plotly_df[plotly_df['year'] != '2020']
plotly_df_filtered.loc[:, 'Growth Rate'] = plotly_df_filtered['students'].pct_change() * 100
plotly_df_filtered
C:\Users\jairo\AppData\Local\Temp\ipykernel_1492\1500618490.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[7]:
year institution modality option shift students sex age type grade Growth Rate
0 2019 ITESO MIXTA MIXTA NOCTURNO 69 all Total total total NaN
2 2021 ITESO ESCOLARIZADA PRESENCIAL MATUTINO 192 all Total total total 178.26
3 2022 ITESO ESCOLARIZADA PRESENCIAL MATUTINO 394 all Total total total 105.21

Taking into account the changes made in 2021, I'll procede to estimate the number of students for 2023, 2024 and 2025 using a linear regressions with the data from 2021 and 2022.

In [8]:
X = plotly_df_filtered[1:3][['year']]
y = plotly_df_filtered[1:3]['students']

model = LinearRegression()
model.fit(X, y)
Out[8]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [9]:
predicted_values = model.predict([[2023], [2024], [2025]])
predicted_values
C:\Users\jairo\Documents\GitHub\datachallenge\env\lib\site-packages\sklearn\base.py:493: UserWarning:

X does not have valid feature names, but LinearRegression was fitted with feature names

Out[9]:
array([ 596.,  798., 1000.])
In [10]:
data = {
    'year': [*plotly_df['year'].tolist(), '2023 *', '2024 *', '2025 *'],
    'students': [*plotly_df['students'].tolist(), 
                 int(predicted_values[0]),
                 int(predicted_values[1]),
                 int(predicted_values[2])]
}
df_plot = pd.DataFrame(data)
In [11]:
title='Number of high school students enrolled at ITESO per year'
fig = px.bar(
    df_plot, x='year', y='students',
    title=title,
    category_orders={'Year': [2019, 2021, 2022]},
    text_auto='.s',
)

fig.update_layout(    
    xaxis_title='Year',
    yaxis_title='Number of Students', 
    annotations=[
        dict(
            x=0.7, y=1,
            xref='paper', yref='paper',
            text="* estimated data",
            showarrow=False, textangle=0,            
        )
    ]
)
fig = update_fig_layout(fig, "high_school", title)
fig.show()
In [12]:
plotly_df_filtered = df_plot[df_plot['year'] != '2020']
plotly_df_filtered.loc[:, 'Growth Rate'] = plotly_df_filtered['students'].pct_change() * 100
plotly_df_filtered
C:\Users\jairo\AppData\Local\Temp\ipykernel_1492\2493306804.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[12]:
year students Growth Rate
0 2019 69 NaN
2 2021 192 178.26
3 2022 394 105.21
4 2023 * 596 51.27
5 2024 * 798 33.89
6 2025 * 1000 25.31

As I mentioned before, ITESO experienced a 105% growth rate in 2022. Using a linear regression model, I estimated the values for the next three years and also projected the growth rate. Based on the available data, I estimate that ITESO will have 1,000 enrolled students by 2025.

We also need to consider that this is an estimation based on a very limited amount of data available and it doesn't factor in other variables such as the potential student population, ITESO's infrastructure, and costs.

Now let's take a look into the students distribution

In [13]:
# Agregating data of enrolled students
group_by_dict = {
    'PERIODO':'year', 
    'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict([f"V{i}" for i in range(402, 692 + 1)])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)

plotly_df_filtered = plotly_df
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['institution']=='ITESO']
# plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['year']=='2019']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['type']!='total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['age']!='Total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['grade']=='total']
In [14]:
# Agregating data of enrolled students 2019
plotly_df_filtered_2019 = plotly_df_filtered[plotly_df_filtered['year']=='2019']
grouped_df = plotly_df_filtered_2019.groupby(['age', 'sex'])['students'].sum().unstack(fill_value=0)
grouped_df = grouped_df.reset_index()

y_age = grouped_df['age'] 
x_M = grouped_df['male'] 
x_F = grouped_df['female'] * -1

# Creating instance of the figure and adding data
fig_2019 = gp.Figure() 
fig_2019.add_trace(gp.Bar(y=y_age, x=x_M, name='Male', orientation='h', text=grouped_df['male']))
fig_2019.add_trace(gp.Bar(y=y_age, x=x_F, name='Female', orientation='h', text=grouped_df['female']))

title = "Age distribution of ITESO's 2019 high school student population"
fig_2019.update_layout(title = title, 
                barmode = 'relative', 
				bargap = 0.1, 
                height=600,
                yaxis_title='Age',
				xaxis = dict(
                    tickvals = [-2000, -1000, -500, -200, -100, -50, -10, 0, 10, 50, 100, 200, 500, 1000, 2000], 
					ticktext = ['2K', '1K', '500', '200', '100', '50', '10', '0', '10', '50', '100', '200', '500', '1K', '2K'],
                    title = 'Number of student')
) 
fig_2019 = update_fig_layout(fig_2019, "high_school", title)
# fig_2019.show()
In [15]:
# Agregating data of enrolled students 2022
plotly_df_filtered_2022 = plotly_df_filtered[plotly_df_filtered['year']=='2022']
grouped_df = plotly_df_filtered_2022.groupby(['age', 'sex'])['students'].sum().unstack(fill_value=0)
grouped_df = grouped_df.reset_index()

y_age = grouped_df['age'] 
x_M = grouped_df['male'] 
x_F = grouped_df['female'] * -1

# Creating instance of the figure and adding data
fig_2022 = gp.Figure() 
fig_2022.add_trace(gp.Bar(y=y_age, x=x_M, name='Male', orientation='h', text=grouped_df['male']))
fig_2022.add_trace(gp.Bar(y=y_age, x=x_F, name='Female', orientation='h', text=grouped_df['female']))

title = "Age distribution of ITESO's 2022 high school student population"
fig_2022.update_layout(title = title, 
                barmode = 'relative', 
				bargap = 0.1, 
                height=600,
                yaxis_title='Age',
				xaxis = dict(
                    tickvals = [-2000, -1000, -500, -200, -100, -50, -10, 0, 10, 50, 100, 200, 500, 1000, 2000], 
					ticktext = ['2K', '1K', '500', '200', '100', '50', '10', '0', '10', '50', '100', '200', '500', '1K', '2K'],
                    title = 'Number of student') 
) 
fig_2022 = update_fig_layout(fig_2022, "high_school", title)
# fig_2022.show()
In [16]:
fig_2019.show()
In [17]:
fig_2022.show()

Consistent with the changes implemented in 2021, changing shift from nighttime to daytime, we can see a change in student demographics from 2019 to 2022.

In 2019 the students were older, I infer the students were mainly adults who went to school after work. The shift to daytime classes since 2021 suggests a student population more typical of a traditional school setting, with students of school-age.

Now let's explore the student distribution by grades

In [18]:
group_by_dict = {
    'PERIODO':'year', 
    'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict([f"V{i}" for i in range(402, 692 + 1)])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)

plotly_df_filtered = plotly_df
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['institution']=='ITESO']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['type']=='total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['age']=='Total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['grade']!='total']
In [19]:
title='Number of high school students enrolled at ITESO per year and grade'
fig = px.histogram(
    plotly_df_filtered, x="grade", y="students", 
    title=title,
    text_auto='.s',
    facet_col='year',
    category_orders={"institution": ["average"] + sorted(plotly_df['institution'].unique())},
)
fig.update_layout(yaxis_title='Number of Students') 
fig = update_fig_layout(fig, "high_school", title)
fig.show()

This is interesting. By examining the distribution and the previous graphs, we can see that ITESO High School rebranded itself and has been operating as a brand-new school since 2021.

With the changes implemented in 2021, we already observed a redistribution in the student demographics. This new graph shows a redistribution in grades

Let's explore scholarships

In [20]:
group_by_dict = {
    'PERIODO':'year', 
    'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict([f"V{i}" for i in range(200, 261 + 1)])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)

plotly_df_filtered = plotly_df
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['institution']=='ITESO']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['students']>0]
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['sex']=='all']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['scholarship_type']!='Total']
variable V230 was not found!
variable V258 was not found!
In [21]:
plotly_df_filtered['year'] = pd.to_numeric(plotly_df_filtered['year']) 
plotly_df_filtered['year'] = plotly_df_filtered['year'] - 1
plotly_df_filtered
Out[21]:
year institution students sex scholarship_type detail
612 2018 ITESO 46 all Beca de la propia institución NaN
723 2021 ITESO 177 all Beca particular NaN
In [22]:
title="High school students enrolled at ITESO with scholarship per year and scholarship type"
plotly_df_filtered = plotly_df_filtered.rename(columns={'scholarship_type': 'Scholarship Types', 'students': 'Number of Students'})
fig = px.bar(
    plotly_df_filtered, x='Scholarship Types', y='Number of Students',
    title=title,
    barmode='group',
    facet_col='year',            
    text_auto='.s'    
)
fig = update_fig_layout(fig, "high_school", title)
fig.show()

It's worth mentioning that the information reported about scholarships relates to the previous year. That's why, even though the last reported year was 2022, the most recent scholarship information we have is for 2021.

Having 177 students in 2021 with scholarships is an important figure, considering that the total number of students enrolled in 2021 was 192; this represents 92.2% of the students

In [23]:
year = '2022'
students_enrolled_with_scholarship_2021 = int(clean_data_df_ITESO[clean_data_df_ITESO['PERIODO']==year]['V261'])
students_enrolled_2021 = int(clean_data_df_ITESO[clean_data_df_ITESO['PERIODO']==str(int(year)-1)]['V692'])
students_enrolled_without_scholarship_2021 = students_enrolled_2021 - students_enrolled_with_scholarship_2021
data = {'status': ['Students with scholarship', 'Students without scholarship'], 'students': [students_enrolled_with_scholarship_2021, students_enrolled_without_scholarship_2021]}
df = pd.DataFrame(data)

title = 'Enrolled high school students at ITESO in 2021 with and without scholarships'
fig = px.pie(
    df, values='students', names='status',
    title=title
)
fig = update_fig_layout(fig, "high_school", title)
fig.show()

In order to evaluate the purchase of ITESO High School, we need to consider that there is no financial information available, such as tuition value to calculate income or operating costs. This limitation restricts the scope of how we can compare schools within a financial context.

Another method to evaluate school performance and rank them would be by using a standardized score, such as the SATs in the United States. In Mexico, the equivalent is the 'Examen Nacional de Ingreso a la Educación Media Superior' (EXANI-I), but unfortunately, we don't have access to this information either.

The best proxy to evaluate the school could be the actual number of enrolled students, their growth rate, and the number of students with outstanding abilities. Although the assessment of outstanding abilities may be subjective, it is the best metric available.

Before we continue, another important consideration is that high schools operate in a different context than universities or graduate schools. Unlike universities or graduate schools, where students typically relocate to attend campus, high school students usually do not move to a different city. Instead, families may move to a city for various reasons and then seek out the best school they can afford. Therefore, it is not appropriate to compare high schools that are localted in different cities.

Let's start by calculating the number of enrolled students and their growth rate.

In [24]:
# Get the city where ITESO High School is located
ITESO_city = clean_data_df_ITESO['CV_MUN'].unique()[0]

# Get the data of enrolled students in 2021 and 2022
df_filtered = clean_data_df[(clean_data_df['CV_MUN'] == ITESO_city) & (clean_data_df['PERIODO'].isin(['2021', '2022']))]
plotly_df = df_filtered.groupby(['NOMBRE_INS_PLA', 'C_TURNO', 'PERIODO'])['V692'].sum().unstack(fill_value=0).reset_index()

# Calculate the growth_percent
plotly_df["growth_percent"] = ((plotly_df["2022"] - plotly_df["2021"]) / plotly_df["2021"]) * 100

# Filter schools that don't have any enrolled students in 2021 or 2022, or those that have a negative growth percent,
# this schools are already performing worst that ITESO and I won't consider them as competitors
plotly_df = plotly_df[(plotly_df['2021'] > 0) & (plotly_df['2022'] > 0) & (plotly_df['growth_percent']> 0)]
plotly_df_melted = plotly_df.melt(id_vars=['NOMBRE_INS_PLA', 'C_TURNO'], var_name='PERIODO', value_name='students')
plotly_df_melted = plotly_df_melted[plotly_df_melted['PERIODO'] == '2022']
plotly_df_melted.head()
Out[24]:
NOMBRE_INS_PLA C_TURNO PERIODO students
34 CECYTE EMSAD 63 TURUNDEO MATUTINO 2022 116.0
35 CECYTE EMSAD NUM.77 "FLOR BATAVIA" MATUTINO 2022 151.0
36 CENTRO DE ESTUDIOS ADMINISTRATIVOS DE OCCIDENTE VESPERTINO 2022 29.0
37 CENTRO DE ESTUDIOS UNIVERSITARIOS VERACRUZ MATUTINO 2022 249.0
38 CENTRO EDUCATIVO MARSELLA MATUTINO 2022 129.0
In [25]:
plotly_df_melted['NOMBRE_INS_PLA'].unique().shape
Out[25]:
(26,)

Great, now we can see that there are 25 high schools in Jalisco besides ITESO. Let's find out which ones are targeting a similar group of students

In [26]:
title='High school students enrolled in 2022 per Institutions and shift located in Jalisco'
fig = px.histogram(
    plotly_df_melted, x="students", y="NOMBRE_INS_PLA", 
    title=title,
    color='C_TURNO',
    text_auto='.s'
)
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(xaxis_title='Number of Students', yaxis_title='Institution')
fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig = update_fig_layout(fig, "high_school", title)
fig.show()

We can see that by the number of enrolled students in 2022, ITESO ranks 8th out of 26. Additionally, there are schools that operate in multiple shifts, such as "Preparatoria 6", which offers "Matutino", "Vespertino", and "Discontinuo". This allows it to enroll more students.

Once again, we'll filter the data to match ITESO's criteria, so we'll be filtering schools that offer the "Matutino" shift like ITESO.

As mentioned before, now that we have the number of enrolled students, I'll incorporate the number of students with outstanding abilities. This will help us classify schools that are similar and provide a better understanding of the schools. For this, I'll be using the K-Means clustering technique.

In [27]:
ITESO_city = clean_data_df_ITESO['CV_MUN'].unique()[0]
df_filtered = clean_data_df[clean_data_df['CV_MUN']==ITESO_city]
df_filtered = df_filtered[df_filtered['PERIODO'].isin(['2021', '2022'])]
df_filtered = df_filtered[df_filtered['C_TURNO']=='MATUTINO']

plotly_df=df_filtered.groupby(['CCT_INS_PLA', 'NOMBRE_INS_PLA', 'PERIODO'])['V692'].sum().unstack(fill_value=0).reset_index()
plotly_df["growth_percent"] = ((plotly_df["2022"] - plotly_df["2021"]) / plotly_df["2021"]) * 100
plotly_df = plotly_df[(plotly_df['2021'] > 0) & (plotly_df['2022'] > 0) & (plotly_df['growth_percent']> 0)]
plotly_df=plotly_df.reset_index(drop=True)
# plotly_df
In [28]:
temp = clean_data_df[
    (clean_data_df['CCT_INS_PLA'].isin(plotly_df['CCT_INS_PLA'])) &
    (clean_data_df['PERIODO']=='2022') &
    (clean_data_df['V692']>0)
]
temp2=temp.groupby(['CCT_INS_PLA'])[['V942', 'V945', 'V948', 'V951', 'V954']].sum().reset_index()
In [29]:
plotly_df = pd.merge(plotly_df, temp2, on='CCT_INS_PLA', how='left')
plotly_df.head()
Out[29]:
CCT_INS_PLA NOMBRE_INS_PLA 2021 2022 growth_percent V942 V945 V948 V951 V954
0 14MMS0078X COLEGIO DE BACHILLERES 5 566 589 4.06 28 54 54 64 58
1 14MMS0081K COLEGIO DE BACHILLERES 8 211 230 9.00 0 0 0 0 0
2 14MMS0278V ESCUELA PREPARATORIA SANTA MARIA TEQUEPEXPAN 246 259 5.28 0 0 0 0 0
3 14MMS0326O CENTRO DE ESTUDIOS UNIVERSITARIOS VERACRUZ 175 249 42.29 0 0 0 0 0
4 14MMS0385D INSTITUTO LIDERES DEL SIGLO 163 169 3.68 0 0 0 0 0
In [30]:
# I'll be using the number of enrolled students in 2022, the groth percent and 
# number of students with outstanding abilities to run the kmeans model
k = 5
kmeans = KMeans(n_clusters=k, random_state=0)
enrollment_data = plotly_df[["2022", 'growth_percent', 'V942', 'V945', 'V948', 'V951', 'V954']]

# Fit the model to the data
kmeans.fit(enrollment_data)

# Assign the cluster to each school
plotly_df["cluster"] = kmeans.labels_
plotly_df = plotly_df.sort_values(by='cluster', ascending=True)

cluster_colors_map = {}
for i in range(k):
    cluster_colors_map.update({i:str(i)})

colors = [cluster_colors_map.get(label, px.colors.qualitative.Plotly[0]) for label in plotly_df["cluster"]]
plotly_df['label'] = plotly_df['NOMBRE_INS_PLA'].apply(lambda x: x if x == 'ITESO' else '')

title= "School Enrollment Distribution (2021 vs. 2022) with Growth"
fig = px.scatter(
    plotly_df, x="2022", y="growth_percent",
    title=title,    
    size="growth_percent", size_max=50,
    color=colors, opacity=0.7,
    hover_name='NOMBRE_INS_PLA', hover_data={'growth_percent':':.2f','cluster':True},    
    labels={"2022": "Students in 2022", "growth_percent":"Growth percent", "cluster":"Cluster"},  
    text='label'
)
fig.update_layout(legend_title_text="Cluster" if colors else None)
fig.update_layout(xaxis_title='Number of Students in 2022', yaxis_title='Growth percentage')
fig = update_fig_layout(fig, "high_school", title)
fig.show()
C:\Users\jairo\Documents\GitHub\datachallenge\env\lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning:

Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.

  File "C:\Users\jairo\Documents\GitHub\datachallenge\env\lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
    raise ValueError(f"found {cpu_count_physical} physical cores < 1")

We can see that ITESO was classified into a cluster along with three other schools. However, ITESO's growth is greater than the other schools'. Let's plot the same data with the new cluster information

In [31]:
title='High school students enrolled in 2022 per Institutions (clustered) located in Jalisco'
fig = px.histogram(
    plotly_df, x="2022", y="NOMBRE_INS_PLA", 
    title=title, 
    text_auto='.s', 
    color='cluster'
)
fig.update_yaxes(categoryorder="total ascending")
fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig.update_layout(xaxis_title='Number of Students', yaxis_title='Institutions')
fig = update_fig_layout(fig, "high_school", title)
fig.show()

With this new information, we see that ITESO has three close competitors in the short term. However, in the future, there are schools with a greater number of students.

One last graph I would like to see is one that shows the number of students with outstanding abilities by type of ability

In [32]:
ITESO_city = clean_data_df_ITESO['CV_MUN'].unique()[0]
df_filtered = clean_data_df[clean_data_df['CV_MUN']==ITESO_city]
df_filtered = df_filtered[df_filtered['PERIODO'].isin(['2021', '2022'])]
df_filtered = df_filtered[df_filtered['C_TURNO']=='MATUTINO']

plotly_df=df_filtered.groupby(['CCT_INS_PLA', 'NOMBRE_INS_PLA', 'PERIODO'])['V692'].sum().unstack(fill_value=0).reset_index()
plotly_df["growth_percent"] = ((plotly_df["2022"] - plotly_df["2021"]) / plotly_df["2021"]) * 100
plotly_df = plotly_df[(plotly_df['2021'] > 0) & (plotly_df['2022'] > 0) & (plotly_df['growth_percent']> 0)]
plotly_df=plotly_df.reset_index(drop=True)

plotly_df = clean_data_df[
    (clean_data_df['CCT_INS_PLA'].isin(plotly_df['CCT_INS_PLA'])) &
    (clean_data_df['PERIODO']=='2022') &
    (clean_data_df['V692']>0)
]
In [33]:
group_by_dict = {
    'PERIODO':'year', 
    'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict(['V942', 'V945', 'V948', 'V951', 'V954'])
plotly_df = get_plotly_df(plotly_df, variables_dict, group_by_dict)
plotly_df = plotly_df[plotly_df['students']>0]
# plotly_df
In [34]:
title='Number of students with outstanding abilities in 2022 by institution and type of ability'
fig = px.bar(
    plotly_df, x='students', y='institution',
    title=title,
    color='aptitude',
    text_auto='.s',             
)
fig.update_layout(xaxis_title='Number of Students',yaxis_title='Aptitude')
fig = update_fig_layout(fig, "high_school", title)
fig.show()

This graph shows something interesting. Even though ITESO is a small new school, it demonstrates that their curriculum or culture is diverse and aims to develop multiple aptitudes in their students. This is unlike other institutions where the only aptitude highlighted is intellectual. I'm not saying that an intellectual aptitude is not important, but ITESO and "Colegio de Bachilleres 5" seem to be the only institutions where their students develop multiple types of abilities.

<< High School - Data cleaning
Summary and Conclusions >>
<< High School - Data cleaning
Summary and Conclusions >>
Your Picture

Developed by Jairo Ruiz Saenz

V1.0 - March 10th 2024